[XPU] add build_sampling_params op.#7738
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 为 XPU 后端新增 build_sampling_params 自定义算子,用 XPU kernel 替换原先 Python 的 sampling 参数 padding 逻辑,并尝试将 infer_seed 的更新收敛到算子内部,以对齐 GPU 的 seed 步进策略(尤其在 speculative decoding 场景)。
Changes:
- 新增 XPU
build_sampling_paramskernel + plugin wrapper + Paddle static op,并在 XPU speculative verify(TARGET_MATCH)路径中接入。 - XPU ModelRunner 侧引入
increment_value(对齐 GPU:非 speculative 为 4,speculative 为(num_speculative_tokens+1)*4),并调整infer_seed的更新时机。 - 新增
custom_ops/xpu_ops/test/test_build_sampling_params.py单测,对比 Python 参考实现并覆盖多种 batch 形态与 wrap-around。
PR 元信息检查(需补充)
- 标题已包含
[XPU]tag,格式符合要求。 - 描述中 “Modifications / Usage or Command / Accuracy Tests” 等小节未补全;若该算子会影响采样结果或可复现性,建议补充 accuracy 对比与对应运行命令/环境信息;如不加单测或无法跑到 XPU CI,也需注明原因(本 PR 已新增单测文件,但仍建议在描述里给出如何运行)。
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/worker/xpu_model_runner.py | 计算并下发 increment_value,并调整 speculative 场景下 infer_seed 的更新逻辑 |
| fastdeploy/model_executor/layers/sample/sampler.py | XPU verify(TARGET_MATCH) 路径改用 build_sampling_params,并透传 increment_value |
| custom_ops/xpu_ops/test/test_build_sampling_params.py | 新增 XPU op 单测,与 Python 参考实现对齐校验 |
| custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp | 新增 plugin wrapper(CPU + XPU3 分发) |
| custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu | 新增 Kunlun3 XPU kernel 实现 |
| custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h | 导出 build_sampling_params 声明 |
| custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc | 新增 Paddle static op 注册与调用桥接 |
| # 7. Updata 'infer_seed' and step_paddle() | ||
| self.share_inputs["infer_seed"].add_(self.infer_seed_increment) | ||
| self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED | ||
| if not self.speculative_decoding: |
| share_inputs["seq_lens_this_time"], | ||
| share_inputs["seq_lens_encoder"], | ||
| token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]), | ||
| increment_value=increment_value, |
| api::Context* ctx = xpu_ctx->x_context(); | ||
| if (top_p.is_cpu()) { | ||
| ctx = new api::Context(api::kCPU); |
| // Shared prefix-sum buffer: each cluster computes its own pad_start via | ||
| // a two-pass scan over seq_lens_this_time / seq_lens_encoder. | ||
| // We use a simple approach: core 0 of cluster 0 writes per-batch start | ||
| // offsets into a global scratch area is not available here, so instead we | ||
| // compute pad_start with a sequential scan in core 0 of each cluster. | ||
| // Because clusters run concurrently we cannot share a global accumulator; | ||
| // instead each cluster independently sums the first `bi` entries. | ||
| // This is O(bs) per cluster but bs is typically small (<=512). |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览存在 1 个 Required 任务失败(
2 任务状态汇总2.1 Required任务 : 7/10 通过
2.2 可选任务 — 27/32 通过
3 失败详情(仅 required)Approval — PR流程(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请求@qingqing01等FastDeploy RD和@jeff41404等PaddlePaddle RD Approve 关联变更: PR 标题 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7738 +/- ##
==========================================
Coverage ? 63.15%
==========================================
Files ? 461
Lines ? 64129
Branches ? 9824
==========================================
Hits ? 40501
Misses ? 20852
Partials ? 2776
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| top_k=sampling_metadata.top_k, | ||
| top_k_list=sampling_metadata.top_k_list, | ||
| topp_seed=topp_seed, | ||
| topp_seed=sampling_metadata.topp_seed, |
| self.increment_value = ( | ||
| 4 if not self.speculative_decoding else (self.speculative_config.num_speculative_tokens + 1) * 4 | ||
| ) |
| # 7. Updata 'infer_seed' and step_paddle() | ||
| self.share_inputs["infer_seed"].add_(self.infer_seed_increment) | ||
| self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED | ||
| if not self.speculative_decoding: |
| // Shared prefix-sum buffer: each cluster computes its own pad_start via | ||
| // a two-pass scan over seq_lens_this_time / seq_lens_encoder. | ||
| // We use a simple approach: core 0 of cluster 0 writes per-batch start | ||
| // offsets into a global scratch area is not available here, so instead we | ||
| // compute pad_start with a sequential scan in core 0 of each cluster. | ||
| // Because clusters run concurrently we cannot share a global accumulator; | ||
| // instead each cluster independently sums the first `bi` entries. | ||
| // This is O(bs) per cluster but bs is typically small (<=512). | ||
|
|
||
| for (int bi = clusterid; bi < bs; bi += nclusters) { | ||
| if (cid == 0) { | ||
| // Read per-batch parameters from global memory. | ||
| float lm_top_p; | ||
| int64_t lm_top_k; | ||
| int64_t lm_seed; | ||
| int lm_slt; // seq_lens_this_time[bi] | ||
| int lm_sle; // seq_lens_encoder[bi] | ||
|
|
||
| GM2LM_ASYNC(top_p + bi, &lm_top_p, sizeof(float)); | ||
| GM2LM_ASYNC(top_k + bi, &lm_top_k, sizeof(int64_t)); | ||
| GM2LM_ASYNC(infer_seed + bi, &lm_seed, sizeof(int64_t)); | ||
| GM2LM_ASYNC(seq_lens_this_time + bi, &lm_slt, sizeof(int)); | ||
| GM2LM(seq_lens_encoder + bi, &lm_sle, sizeof(int)); // sync barrier | ||
|
|
||
| bool is_decoder = (lm_sle == 0); | ||
| int repeat = is_decoder ? lm_slt : 1; | ||
|
|
||
| // Compute pad_start = sum of token counts for batches [0, bi). | ||
| int pad_start = 0; | ||
| for (int k = 0; k < bi; k++) { | ||
| int slt_k, sle_k; | ||
| GM2LM_ASYNC(seq_lens_this_time + k, &slt_k, sizeof(int)); | ||
| GM2LM(seq_lens_encoder + k, &sle_k, sizeof(int)); | ||
| pad_start += (sle_k == 0) ? slt_k : 1; | ||
| } |
| RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-814e8d96-3da8-46b0-b4da-31925c313041] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n await self.add_requests(request)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[]) | ||
| RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-799cdf97-ab7e-4823-80e4-1833bf5f7d90] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n await self.add_requests(request)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[]) |
651d7cb to
cfc5936
Compare
| _, next_tokens = top_k_top_p_sampling( | ||
| probs, | ||
| top_p=top_p, | ||
| top_k=top_k, | ||
| top_p=sampling_metadata.top_p, | ||
| top_k=sampling_metadata.top_k, | ||
| top_k_list=sampling_metadata.top_k_list, | ||
| topp_seed=topp_seed, | ||
| topp_seed=sampling_metadata.topp_seed, | ||
| ) |
| sampling_metadata.seed, | ||
| paddle.reshape(share_inputs["seq_lens_this_time"], shape=[-1]), | ||
| paddle.reshape(share_inputs["seq_lens_encoder"], shape=[-1]), | ||
| share_inputs["seq_lens_this_time"], | ||
| share_inputs["seq_lens_encoder"], | ||
| token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]), | ||
| increment_value=increment_value, | ||
| ) |
| self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED | ||
| if not self.speculative_decoding: | ||
| self.share_inputs["infer_seed"].add_(self.infer_seed_increment) | ||
| self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED |
| int64_t pad_idx = 0; | ||
| for (int bi = 0; bi < bs; bi++) { | ||
| bool is_decoder = (seq_lens_encoder[bi] == 0); | ||
| int repeat = is_decoder ? seq_lens_this_time[bi] : 1; | ||
| int64_t bi_seed = infer_seed[bi]; | ||
| for (int local_pos = 0; local_pos < repeat; local_pos++) { | ||
| int64_t offset = is_decoder ? static_cast<int64_t>(local_pos) * 4 : 0LL; | ||
| top_p_padding[pad_idx] = top_p[bi]; | ||
| top_k_padding[pad_idx] = top_k[bi]; | ||
| topp_seed[pad_idx] = (bi_seed + offset) % BUILD_SAMPLING_MAX_INFER_SEED; | ||
| pad_idx++; | ||
| } | ||
| infer_seed[bi] = | ||
| (infer_seed[bi] + increment_value) % BUILD_SAMPLING_MAX_INFER_SEED; | ||
| } |
| // Shared prefix-sum buffer: each cluster computes its own pad_start via | ||
| // a two-pass scan over seq_lens_this_time / seq_lens_encoder. | ||
| // We use a simple approach: core 0 of cluster 0 writes per-batch start | ||
| // offsets into a global scratch area is not available here, so instead we | ||
| // compute pad_start with a sequential scan in core 0 of each cluster. | ||
| // Because clusters run concurrently we cannot share a global accumulator; | ||
| // instead each cluster independently sums the first `bi` entries. | ||
| // This is O(bs) per cluster but bs is typically small (<=512). | ||
|
|
||
| for (int bi = clusterid; bi < bs; bi += nclusters) { | ||
| if (cid == 0) { | ||
| // Read per-batch parameters from global memory. | ||
| float lm_top_p; | ||
| int64_t lm_top_k; | ||
| int64_t lm_seed; | ||
| int lm_slt; // seq_lens_this_time[bi] | ||
| int lm_sle; // seq_lens_encoder[bi] | ||
|
|
||
| GM2LM_ASYNC(top_p + bi, &lm_top_p, sizeof(float)); | ||
| GM2LM_ASYNC(top_k + bi, &lm_top_k, sizeof(int64_t)); | ||
| GM2LM_ASYNC(infer_seed + bi, &lm_seed, sizeof(int64_t)); | ||
| GM2LM_ASYNC(seq_lens_this_time + bi, &lm_slt, sizeof(int)); | ||
| GM2LM(seq_lens_encoder + bi, &lm_sle, sizeof(int)); // sync barrier | ||
|
|
||
| bool is_decoder = (lm_sle == 0); | ||
| int repeat = is_decoder ? lm_slt : 1; | ||
|
|
||
| // Compute pad_start = sum of token counts for batches [0, bi). | ||
| int pad_start = 0; | ||
| for (int k = 0; k < bi; k++) { | ||
| int slt_k, sle_k; | ||
| GM2LM_ASYNC(seq_lens_this_time + k, &slt_k, sizeof(int)); | ||
| GM2LM(seq_lens_encoder + k, &sle_k, sizeof(int)); | ||
| pad_start += (sle_k == 0) ? slt_k : 1; | ||
| } |
| """Normal sampling for NAIVE mode on XPU.""" | ||
| top_p, top_k, topp_seed = padding_sampling_params( | ||
| sampling_metadata.top_p, | ||
| sampling_metadata.top_k, | ||
| sampling_metadata.seed, | ||
| paddle.reshape(share_inputs["seq_lens_this_time"], shape=[-1]), | ||
| paddle.reshape(share_inputs["seq_lens_encoder"], shape=[-1]), | ||
| ) | ||
| _, next_tokens = top_k_top_p_sampling( | ||
| probs, | ||
| top_p=top_p, | ||
| top_k=top_k, | ||
| top_p=sampling_metadata.top_p, | ||
| top_k=sampling_metadata.top_k, | ||
| top_k_list=sampling_metadata.top_k_list, | ||
| topp_seed=topp_seed, | ||
| topp_seed=sampling_metadata.topp_seed, | ||
| ) |
| share_inputs["seq_lens_this_time"], | ||
| share_inputs["seq_lens_encoder"], | ||
| token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]), | ||
| increment_value=increment_value, | ||
| ) |
| # 7. Updata 'infer_seed' and step_paddle() | ||
| self.share_inputs["infer_seed"].add_(self.infer_seed_increment) | ||
| self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED | ||
| if not self.speculative_decoding: | ||
| self.share_inputs["infer_seed"].add_(self.infer_seed_increment) | ||
| self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED |
| PD_BUILD_STATIC_OP(build_sampling_params) | ||
| .Inputs({"top_p", | ||
| "top_k", | ||
| "infer_seed", | ||
| "seq_lens_this_time", | ||
| "seq_lens_encoder"}) | ||
| .Outputs({"top_p_padding", "top_k_padding", "topp_seed"}) | ||
| .Attrs({"token_num_output_cpu: int64_t", "increment_value: int64_t"}) | ||
| .SetKernelFn(PD_KERNEL(BuildSamplingParams)) | ||
| .SetInferShapeFn(PD_INFER_SHAPE(BuildSamplingParamsInferShape)) | ||
| .SetInferDtypeFn(PD_INFER_DTYPE(BuildSamplingParamsInferDtype)); |
| // Shared prefix-sum buffer: each cluster computes its own pad_start via | ||
| // a two-pass scan over seq_lens_this_time / seq_lens_encoder. | ||
| // We use a simple approach: core 0 of cluster 0 writes per-batch start | ||
| // offsets into a global scratch area is not available here, so instead we | ||
| // compute pad_start with a sequential scan in core 0 of each cluster. | ||
| // Because clusters run concurrently we cannot share a global accumulator; | ||
| // instead each cluster independently sums the first `bi` entries. | ||
| // This is O(bs) per cluster but bs is typically small (<=512). |
| """ | ||
| Unit tests for build_sampling_params XPU op. | ||
|
|
||
| Verifies that the XPU kernel produces the same output as the Python reference | ||
| implementation (padding_sampling_params) for all cases: | ||
| - pure decoder batches (seq_lens_encoder == 0) | ||
| - pure encoder batches (seq_lens_encoder > 0) | ||
| - mixed encoder/decoder batches | ||
| - single-item batch (bs=1) | ||
| - seed wrap-around near MAX_INFER_SEED | ||
| """ |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-12 14:22:53
📋 Review 摘要
PR 概述:将 XPU 下 padding_sampling_params 的 Python 实现替换为 XPU kernel 实现 build_sampling_params,同时将 infer_seed 更新收敛到 kernel 内部,并对齐 GPU 的 increment_value 步进。
变更范围:custom_ops/xpu_ops/(新 kernel + wrapper + op 注册)、fastdeploy/model_executor/layers/sample/sampler.py、fastdeploy/worker/xpu_model_runner.py
影响面 Tag:[XPU] [OP]
📝 PR 规范检查
标题含合法 Tag [XPU],格式合规。但 ## Modifications 与 ## Usage or Command 两个 section 内容缺失(仅保留模板注释),Checklist 均未勾选。
标题建议(可直接复制):
[XPU][OP] Add build_sampling_params XPU kernel to replace Python padding_sampling_params
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
将 XPU 下 `padding_sampling_params` 的纯 Python 实现改为 XPU kernel 实现 `build_sampling_params`,以减少主机-设备同步开销。此外将 `infer_seed` 的更新逻辑收敛到 kernel 内部,并将 increment_value 步进从原来的 XPU 固定值 4 对齐至 GPU 实现(非投机解码: 4,投机解码: (num_speculative_tokens + 1) * 4)。
## Modifications
- 新增 XPU kernel:`custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`,实现 top_p/top_k/seed padding 及 infer_seed in-place 更新
- 新增 C++ wrapper:`custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`,支持 CPU 和 XPU3 两条执行路径
- 新增 Paddle Custom Op 注册:`custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`
- 在 `plugin.h` 中声明 `build_sampling_params` 接口
- `sampler.py`:在 `_verify_and_sample_xpu` 中用 `build_sampling_params` 替换 `padding_sampling_params`;`_normal_sample_xpu` 直接使用 `sampling_metadata` 字段
- `xpu_model_runner.py`:计算 `increment_value`,非投机解码时继续在 Python 层更新 `infer_seed`,投机解码时由 kernel 内部更新
- 新增单测:`custom_ops/xpu_ops/test/test_build_sampling_params.py`,覆盖纯解码、纯编码、混合、单 batch、seed 回绕等场景
## Usage or Command
N/A(内部实现替换,对外接口不变)
## Accuracy Tests
测试 XPU kernel 内 INT64 取模正常(见 PR 内截图);精度测试结果与原 Python 实现一致。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/model_executor/layers/sample/sampler.py:1080 |
sampling_metadata.topp_seed 属性不存在,NAIVE 模式 XPU 采样将在运行时崩溃 |
总体评价
新增 XPU kernel 实现思路清晰,CPU/XPU3 双路径覆盖完整,单测用例充分。但 _normal_sample_xpu 中使用了 sampling_metadata.topp_seed 这一不存在的属性(SamplingMetadata 只有 seed 字段),将导致非投机解码 XPU 路径在运行时直接崩溃,需修复后合入。
| top_k=sampling_metadata.top_k, | ||
| top_k_list=sampling_metadata.top_k_list, | ||
| topp_seed=topp_seed, | ||
| topp_seed=sampling_metadata.topp_seed, |
There was a problem hiding this comment.
🔴 Bug sampling_metadata.topp_seed 属性不存在,将导致 AttributeError。
SamplingMetadata 数据类(meta_data.py)中只有 seed 字段,没有 topp_seed。当 XPU NAIVE 模式(非投机解码)调用 _normal_sample_xpu 时,该行会在运行时抛出 AttributeError: 'SamplingMetadata' object has no attribute 'topp_seed'。
建议修复方案(二选一):
- 直接使用已有的
seed字段(但需注意 seed 是否已经是 padded 形式):
topp_seed=sampling_metadata.seed,- 若需要 padded seed,应在
SamplingMetadata中新增topp_seed: Optional[paddle.Tensor] = None,并在xpu_model_runner.py的SamplingMetadata(...)构造处通过padding_sampling_params(或新的build_sampling_paramskernel)填充该字段。
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-13 14:35:06
📋 Review 摘要
PR 概述:将 XPU 下 padding_sampling_params 的 Python 实现替换为 XPU kernel 实现 build_sampling_params,同时将 infer_seed 更新逻辑内聚到 kernel 内部(speculative 路径)并对齐 GPU 的步进值。
变更范围:custom_ops/xpu_ops/(新增 kernel/wrapper/op 注册)、sampler.py、xpu_model_runner.py
影响面 Tag:[XPU] [OP]
📝 PR 规范检查
标题含 [XPU] Tag 合规。但描述中 ## Modifications 和 ## Usage or Command 段内容为空(仅含模板注释),## Checklist 条目均未勾选,建议按如下模板补全。
标题建议(可直接复制):
[XPU][OP] Add build_sampling_params XPU kernel to replace padding_sampling_params
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
将 XPU 下的 `padding_sampling_params` Python 实现改为 XPU kernel 实现 `build_sampling_params`,在 kernel 内完成采样参数 padding 操作。同时将 `infer_seed` 更新逻辑收敛到 `build_sampling_params` 内部(speculative decoding 路径),并将 `infer_seed` 的 `increment_value` 步进对齐 GPU 实现(每 token 步进 4)。
## Modifications
- `custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`:新增 XPU3 kernel,支持 decoder/encoder 混合 batch,cluster 0 原地更新 `infer_seed`
- `custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`:新增 CPU 参考实现与 XPU3 wrapper
- `custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`:使用 `PD_BUILD_STATIC_OP` 注册 custom op,输出 `top_p_padding`、`top_k_padding`、`topp_seed`
- `custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h`:声明 `build_sampling_params` 函数接口
- `fastdeploy/model_executor/layers/sample/sampler.py`:`_verify_and_sample_xpu` 替换 `padding_sampling_params` 为 `build_sampling_params`;`_normal_sample_xpu` 直接使用 `sampling_metadata.topp_seed`;为 `forward_xpu` / `_verify_and_sample_xpu` 新增 `increment_value` 参数
- `fastdeploy/worker/xpu_model_runner.py`:新增 `self.increment_value`(非 speculative=4,speculative=(N+1)*4);speculative 路径移除外部 `infer_seed.add_()` 更新,改由 kernel 内部维护
- 新增单测:`custom_ops/xpu_ops/test/test_build_sampling_params.py`(含纯 decoder、纯 encoder、混合、单 batch、seed wrap-around 等 6 个用例)
## Usage or Command
N/A
## Accuracy Tests
XPU kernel 内 INT64 取模验证正常(见 PR 附图)。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc:28 |
infer_seed 以非 const 引用原地修改,但在 PD_BUILD_STATIC_OP 中仅声明为 .Inputs,缺少 SetInplaceMap 声明 |
| 🟡 建议 | custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp:145 |
Wrapper 仅支持 XPU3,其他代次(如 XPU2)运行时会触发 WRAPPER_UNIMPLEMENTED |
| ❓ 疑问 | custom_ops/xpu_ops/ |
新增 .cc / .cpp / .xpu 源文件,diff 未见 setup_ops.py 或 CMakeLists.txt 更新,请确认这些文件是否已加入编译链 |
总体评价
实现思路清晰,将采样参数 padding 与 infer_seed 更新下沉到 XPU kernel,减少主机端开销,并提供了覆盖多场景的单测。建议关注 Paddle custom op 框架的 inplace 声明语义和编译注册完整性。
| std::vector<paddle::Tensor> BuildSamplingParams( | ||
| const paddle::Tensor& top_p, | ||
| const paddle::Tensor& top_k, | ||
| paddle::Tensor& infer_seed, |
There was a problem hiding this comment.
🟡 建议 infer_seed 以非 const 引用传入并在 kernel 内原地更新,但 PD_BUILD_STATIC_OP 中仅通过 .Inputs({"infer_seed"}) 声明为只读输入,未声明 SetInplaceMap。
根据 Paddle custom op 规范,原地修改输入 tensor 需通过 SetInplaceMap 显式告知框架,否则在 AOT / Static Graph 场景下框架可能对输入创建副本,导致 infer_seed 的更新对外不可见(单测中对 seed.numpy() 的结果校验也将失败)。
建议将 infer_seed 同时加入 Outputs 并声明 inplace 映射:
PD_BUILD_STATIC_OP(build_sampling_params)
.Inputs({"top_p", "top_k", "infer_seed", "seq_lens_this_time", "seq_lens_encoder"})
.Outputs({"top_p_padding", "top_k_padding", "topp_seed", "infer_seed_updated"})
.SetInplaceMap({{"infer_seed", "infer_seed_updated"}})
...或直接将更新后的 infer_seed 作为第四个返回值输出。
| token_num, | ||
| increment_value); | ||
| } else if (ctx->dev().type() == api::kXPU3) { | ||
| return xpu3_wrapper(ctx, |
There was a problem hiding this comment.
🟡 建议 Wrapper 当前仅对 api::kXPU3 提供实现,其他 XPU 代次(如 XPU2)会命中末尾的 WRAPPER_UNIMPLEMENTED(ctx) 并在运行时报错。
若此 op 仅针对 XPU3,建议在 op 注册层或调用侧加入硬件代次检查,提前给出明确的报错信息;若未来需要支持 XPU2,可在此处扩展对应分支。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览有 1 个 Required 失败任务(
2 任务状态汇总2.1 Required任务 : 6/10 通过
2.2 可选任务 — 22/26 通过
3 失败详情(仅 required)Approval — 基础设施(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 联系有权限的团队成员审批本PR 链接: 查看日志 |
Motivation
将XPU下的padding_sampling_params的py实现改为XPU kernel实现build_sampling_params,此外将infer_seed更新收敛到build_sampling_params内部,并将infer_seed的increment_value步进对齐GPU实现。
Modifications
Usage or Command
Accuracy Tests
测试XPU kernel内INT64取模正常

Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.